1. Data Processing

Row

a. 随机选取20条样例数据查看

Row

b. 字段解释:数据一共包含11列,3万条记录;“SeriousDlqin2yrs”是因变量

FieldName Description Short
SeriousDlqin2yrs 未来两年内是否会发生逾期 Dlqin2yrs
RevolvingUtilizationOfUnsecuredLines 无担保贷款占信贷限额比率 Utilization
age 贷款人年龄 Age
NumberOfTime30.59DaysPastDueNotWorse 逾期 30 到 59 天的贷款数目 days30_59
DebtRatio 债务比率:债务占月收入比率 DebtRatio
MonthlyIncome 月收入 Income
NumberOfOpenCreditLinesAndLoans 开放贷款和信用贷款的数目 OpenCredit
NumberOfTimes90DaysLate 逾期超过 90 天的的贷款数目 days90
NumberRealEstateLoansOrLines 抵押贷款数目 RealEstate
NumberOfTime60.89DaysPastDueNotWorse 逾期 60 到 89 天的贷款数目 days60_89
NumberOfDependents 家庭成员数量 Dependents

c. 缺失值查看:主要分布在Income列和Dependents列

2. Statistic Analysis

Column

相关矩阵

散点矩阵

各变量条件分布

Column

说明


1.关于缺失值插补与异常值删除

  • 绝大多数的变量都使用中位数插补法

  • 更好的选择是KNN插补,但是特别耗费运算时间

  • 剔除了DebtRatio(负债率)超过100000和Income(月收入)超过300000的记录



2.对类失衡的处理

  • 数据中一共有1962条违约(逾期)记录,占比为0.0654131

  • 采用欠采样的方法处理



3.这三张图表

  • 反映了各自变量相关之间的关系以及与因变量之间的关系

3. Logistic Regression

Column

初步回归

logi.model <- glm(Dlqin2yrs~.,data = ntrain,family = "binomial")  
summary(logi.model)

Call:
glm(formula = Dlqin2yrs ~ ., family = "binomial", data = ntrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.8920  -0.7733  -0.4385   0.8624   2.1964  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.620e-01  1.897e-01  -3.490 0.000483 ***
Utilization  1.838e+00  1.227e-01  14.983  < 2e-16 ***
Age         -2.261e-02  3.295e-03  -6.863 6.74e-12 ***
days30_59    5.712e-01  5.783e-02   9.877  < 2e-16 ***
DebtRatio   -4.978e-05  3.607e-05  -1.380 0.167596    
Income      -2.781e-05  1.054e-05  -2.638 0.008351 ** 
OpenCredit   3.852e-02  8.921e-03   4.319 1.57e-05 ***
days90       1.005e+00  1.110e-01   9.052  < 2e-16 ***
RealEstate   1.003e-01  4.091e-02   2.452 0.014224 *  
days60_89    1.368e+00  1.506e-01   9.083  < 2e-16 ***
Dependents   4.362e-02  3.874e-02   1.126 0.260235    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4501.7  on 3249  degrees of freedom
Residual deviance: 3233.1  on 3239  degrees of freedom
AIC: 3255.1

Number of Fisher Scoring iterations: 6

增加交互项并进行逐步回归


inter.logi <- glm(Dlqin2yrs~(.)^2, data = ntrain, family = "binomial")
step.logi <- step(inter.logi, trace = 0)
Estimate Std. Error z value Pr(>|z|) mark
(Intercept) -0.74861 0.36179 -2.06917 0.03853 *
Utilization 2.17237 0.14513 14.96837 0.00000 **
Age -0.02181 0.00719 -3.03386 0.00241 **
days30_59 0.92860 0.16014 5.79870 0.00000 **
DebtRatio -0.00012 0.00005 -2.23367 0.02550 *
Income 0.00005 0.00005 0.97456 0.32978
OpenCredit -0.07447 0.04030 -1.84788 0.06462
days90 1.85057 0.23538 7.86212 0.00000 **
RealEstate 0.33620 0.19470 1.72678 0.08421
days60_89 1.02719 0.53280 1.92792 0.05386 *
Dependents 0.25361 0.18827 1.34704 0.17797
Utilization:days30_59 -0.46559 0.16032 -2.90413 0.00368 **
Utilization:days90 -1.01120 0.29666 -3.40857 0.00065 **
Utilization:days60_89 -0.67634 0.39065 -1.73133 0.08339
Age:Income 0.00000 0.00000 -1.70176 0.08880
Age:OpenCredit 0.00194 0.00073 2.64573 0.00815 **
Age:RealEstate -0.00666 0.00347 -1.92082 0.05475 *
Age:days60_89 0.01774 0.00987 1.79718 0.07231
Age:Dependents -0.00623 0.00417 -1.49330 0.13536
days30_59:DebtRatio 0.00012 0.00008 1.55256 0.12053
days30_59:Income 0.00002 0.00001 1.24395 0.21352
days30_59:OpenCredit -0.01668 0.01010 -1.65085 0.09877
days30_59:days90 -0.20685 0.06207 -3.33236 0.00086 **
days30_59:days60_89 -0.33258 0.07918 -4.20052 0.00003 **
OpenCredit:RealEstate 0.01083 0.00566 1.91496 0.05550 *
OpenCredit:Dependents 0.01594 0.00839 1.89923 0.05753 *
RealEstate:days60_89 0.31986 0.16359 1.95522 0.05056 *
RealEstate:Dependents -0.05112 0.03473 -1.47204 0.14101

Column

初步回归效应

逐步回归效应

4. Evaluation and Comparing

Row

a.逻辑回归ROC曲线,AUC值为0.8593598

b. 随机森林变量重要性

说明


  • 该逻辑回归是增加交叉项并使用逐步回归筛选变量后的逻辑回归

  • 变量重要性可以用作变量选择以及在权值分配上做参考

  • 随机森林使用了准确率和基尼指数进行重要性的度量

  • 它认为较为重要的变量是:Utilization / Age / days30_59 / days90

Row

c. Xgboost模型变量重要性

d. 四种模型进行比较

说明


  • 这里将逻辑回归与三种(集成)树模型进行了比较

  • 对Xgboost也输出了变量重要性,它认为较为重要的变量是:Utilization / days90 / days30_59 / days60_89

  • 四种模型的AUC值从高到低依次是:
Model AUC
3 GBM 0.8695062
2 RF 0.8661534
4 XGB 0.8624757
1 LR 0.8593598

5. WOE Transformation

Column

变量分箱

Dlqin2yrs Utilization Age days30_59 DebtRatio Income OpenCredit days90 RealEstate days60_89 Dependents
123909 1 3 1 4 3 4 4 1 2 0 1
62200 0 6 3 3 2 4 3 0 1 0 2
111351 1 6 1 5 2 3 3 0 1 0 0
117359 0 2 3 0 1 4 3 0 2 0 0
11221 1 6 4 1 4 1 4 0 1 0 0
89607 0 2 3 1 3 1 3 0 2 0 0
34652 1 6 3 0 4 4 3 0 5 0 0
111292 0 4 2 0 2 2 4 0 1 0 0
118978 0 6 3 0 4 2 1 0 0 1 0
83673 1 1 4 2 4 2 4 0 2 1 0
84537 0 4 1 1 1 3 1 0 0 0 4
76702 0 2 4 0 4 2 3 0 1 0 0
27800 0 1 4 0 1 3 1 0 0 0 0
31192 1 6 4 1 3 1 1 0 0 0 0
78356 1 6 1 0 1 2 1 0 0 0 0
19584 0 2 3 1 3 4 4 0 5 0 1
54580 0 5 4 0 2 4 3 0 4 0 0
81326 0 1 4 0 2 4 3 0 2 0 1
102688 1 6 2 0 3 2 1 4 1 3 2
145125 0 1 1 0 2 2 2 0 2 0 1

WOE变换

Dependents days60_89 RealEstate days90 OpenCredit Income DebtRatio days30_59 Age Utilization Dlqin2yrs
4057 0.0000 -0.2939 -0.2151 -0.3663 -0.1762 -0.2912 -0.1853 -0.5232 0.4455 -0.5936 0
1521 -0.1359 1.8083 -0.2151 3.0681 -0.1762 -0.0688 0.0040 -0.5232 -0.7340 1.1283 1
3684 0.2636 -0.2939 -0.1446 -0.3663 -0.1593 -0.2912 -0.1853 -0.5232 -0.0513 -0.5936 0
1442 -0.1359 1.8083 -0.1446 -0.3663 -0.1593 0.4154 0.0040 -0.5232 -0.0513 1.1283 1
1002 -0.1359 -0.2939 0.1973 -0.3663 0.2683 -0.2912 -0.0640 -0.5232 -0.7340 -1.5180 0
2828 0.1842 -0.2939 -0.1446 -0.3663 -0.1593 -0.2912 0.2402 -0.5232 -0.0513 1.1283 0
1047 -0.1359 -0.2939 -0.2151 1.7977 0.2683 -0.2912 -0.1853 -0.5232 -0.0513 0.2200 0
1524 -0.1359 -0.2939 -0.1446 -0.3663 -0.1593 -0.0470 0.0040 -0.5232 -0.0513 -0.8725 1
1760 -0.1359 -0.2939 -0.2151 -0.3663 -0.1593 -0.0688 0.2402 -0.5232 0.4455 -0.5936 1
3141 0.0685 -0.2939 0.1973 -0.3663 -0.1762 -0.0688 0.0040 -0.5232 0.2453 1.1283 1
3287 0.0685 -0.2939 0.1973 -0.3663 -0.1762 -0.0688 0.0040 -0.5232 0.2453 -0.5936 0
256 -0.1359 -0.2939 0.1973 -0.3663 -0.1762 0.4154 -0.0640 2.4277 0.4455 -0.5936 0
4028 0.7259 -0.2939 0.1973 2.9704 -0.1762 -0.0470 -0.0640 0.9474 0.2453 0.2200 1
3948 0.7259 1.8083 -0.2151 1.7977 -0.1762 0.4154 0.0040 -0.5232 0.2453 1.1283 1
4007 0.7259 -0.2939 -0.1446 -0.3663 -0.1593 -0.2912 -0.1853 -0.5232 -0.7340 -0.5936 0
3263 0.0685 -0.2939 -0.2151 -0.3663 -0.1593 -0.0688 0.2402 -0.5232 0.2453 0.2200 0
332 -0.1359 -0.2939 0.1973 -0.3663 0.2683 0.4154 -0.1853 -0.5232 0.4455 -1.5180 1
1030 -0.1359 -0.2939 -0.2151 -0.3663 0.0634 -0.0688 0.0040 0.9474 -0.7340 -1.5180 0
489 -0.1359 -0.2939 0.1973 2.9704 0.2683 -0.0688 0.0040 1.4214 0.2453 -1.5180 1
3377 0.0685 -0.2939 -0.1446 -0.3663 0.0634 -0.2912 -0.1853 -0.5232 0.2453 -1.5180 0

Column

各变量对应的WOE值

6. Score Card

Column

标准评分卡模型

  1. 逻辑回归的各项系数为:
-0.0290004, 0.6745775, 0.5690218, 0.6982005, 0.5903734, 0.2173597, 0.3615053, 1.0695925, 0.6253708, 0.3927561, 0.6824976


  1. 我们知道,Logistic回归一般方程形式为: \[ \log(odds)=\log(\frac{p}{1-p})=\beta_0+\beta_1*x_1+\beta_2*x_2+...+\beta_9*x_9+\beta_{10}*x_{10} \]

  2. 在我们的分析中,对应方程是: \[ \log(odds)=\log(\frac{p}{1-p})=-0.0290+0.6746*Dependents+0.5690*days60.89+0.6982*RealEstate+0.5904*days90-\\ 0.2174*OpenCredit+0.3615*Income+1.0696*DebtRatio+0.6254*days30.59+0.3928*Age+0.6825*Utilization \] 注意,这里的变量名代表该变量的woe值,且\[p=P_{(Dlqin2yrs=1)}\]

  3. 经过WOE变换后,方程形式为: \[ \log(odds)=\log(\frac{p}{1-p})=\beta_0+\\ \beta_1*(w_{11}*\delta_{11}+w_{12}*\delta_{12}+...)+\\ \beta_2*(w_{21}*\delta_{21}+w_{22}*\delta_{22}+...)+...+\\ \beta_{10}*(w_{10,1}*\delta_{10,1}+w_{10,2}*\delta_{10,2}+...) \]

  4. 因此各自变量每个因子水平的分值即为βi*wij,δ为二元变量,取值只有 0或1,且某行记录在某个xi 的各个因子水平上只能有一个1,其它均是0,取 1表示这行记录在xi 变量上取第j 个水平。该方程也被称作标准评分卡模型。

Column

评分卡的创建和实施

  1. 刻度转换公式:
\[{\rm Score} = A-B*\log(odds)\]
常用的假设: 1. 违约/正常比为15 时,对应600 分。 2. 违约/正常比翻一倍扣掉20 分。 根据下表计算可知: A=521.87, B=28.85。
分值score 好坏比odds 违约概率p
700 1 : 480 0.0020790
680 1 : 240 0.0041494
660 1 : 120 0.0082645
640 1 : 60 0.0163934
620 1 : 30 0.0322581
600 1 : 15 0.0625000
580 1 : 7.5 0.1176471
560 1 : 3.75 0.2105263
540 1 : 1.875 0.3478261


  1. 联立逻辑回归方程和刻度公式,可得分值表达式:
\[ {\rm Score} = A-B\beta_0-\\ B\beta_1*(w_{11}*\delta_{11}+w_{12}*\delta_{12}+...)-\\ B\beta_2*(w_{21}*\delta_{21}+w_{22}*\delta_{22}+...)-...-\\ B\beta_{10}*(w_{10,1}*\delta_{10,1}+w_{10,2}*\delta_{10,2}+...) \]

最终评分卡


在绍兴项目上的应用


银行逾期还款人员 涉毒类重点人员 侵财类重点人员
未来两年内是否会发生逾期(y) 未来XX时间内是否会吸毒(y) 未来XX时间内是否会侵财(y)
无担保贷款占信贷限额比率 年龄 年龄
贷款人年龄 职业 职业
逾期 30 到 59 天的贷款数目 婚姻状况 婚姻状况
债务比率:债务占月收入比率 受教育程度 受教育程度
月收入 出入境特征 是否前科
开放贷款和信用贷款的数目 是否吸毒前科 与前科人员同住一个旅馆
逾期超过 90 天的的贷款数目 是否其他类前科 与前科人员同乘一个航班
抵押贷款数目 入住、退房时间不正常 入住、退房时间不正常
逾期 60 到 89 天的贷款数目 上网时间不正常 上网时间不正常
家庭成员数量 与涉毒人员同住一个旅馆 是否经常本地人入住本地旅馆
··· 与涉毒人员同乘一个航班 过往在街道路面被盘查次数
··· 与其他前科人员同住一个旅馆 ···
··· 与其他前科人员同乘一个航班 ···
··· 是否经常本地人入住本地旅馆 ···
··· ··· ···
---
title: "评分卡制作"
author: "GuQuQ"
output: 
  flexdashboard::flex_dashboard:
    social: menu
    source_code: embed
---

```{r}
knitr::opts_chunk$set(message = F, warning = F,cache = T)
```

```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(dplyr)
library(ggplot2)
library(caret)
library(ROCR)
library(cvAUC)
library(xgboost)
library(plotly)
library(DT)
library(effects)#effects
library(reshape2)#melt
library(psych)#pairs.panels
library(corrplot)#corrplot
library(randomForest)
library(gbm)
library(xgboost)
```



1. Data Processing {data-orientation=rows}
=======================================================================

Row {data-height=600}
-----------------------------------------------------------------------

### a. 随机选取20条样例数据查看

```{r}
df0<-read.csv("C:\\Users\\guquan\\Desktop\\评分卡制作demo\\cs-training.csv")
set.seed(66)
slct=createDataPartition(df0$SeriousDlqin2yrs,p = 0.2,list = F)
df=df0[slct,]

df=df[,-1]
a=names(df)
b=strsplit(x = '未来两年内是否会发生逾期
无担保贷款占信贷限额比率
贷款人年龄
逾期 30 到 59 天的贷款数目
债务比率:债务占月收入比率
月收入
开放贷款和信用贷款的数目
逾期超过 90 天的的贷款数目
抵押贷款数目
逾期 60 到 89 天的贷款数目
家庭成员数量',split = '\n')[[1]]
c=c("Dlqin2yrs","Utilization","Age","days30_59","DebtRatio","Income",
    "OpenCredit","days90","RealEstate","days60_89","Dependents") -> colnames(df)
datatable(sample_n(df,20))
#str(df)
```



Row {data-height=400}
-----------------------------------------------------------------------

### b. 字段解释:数据一共包含11列,3万条记录;“SeriousDlqin2yrs”是因变量

```{r}
kable(data.frame(FieldName =a,Description=b,Short=c),
      format = "markdown", align = "c")
```


### c. 缺失值查看:主要分布在Income列和Dependents列
```{r}
library(VIM)
aggr(df,numbers=T)
#matrixplot(df)
```


```{r}
### 异常值与缺失值处理

df$"Utilization"[df$"Utilization">1] <- median(df$"Utilization",na.rm = TRUE)
df$"days30_59"[df$"days30_59">=96]<-0
df<-df[-which(df$DebtRatio>100000),]
df$"Income"[is.na(df$"Income")]<-median(df$"Income",na.rm = TRUE)
df<-df[-which(df$"Income">300000),]
df$"days90"[df$"days90">90]<-0
df<-df[!(df$"RealEstate"==54),]
df$"days60_89"[df$"days60_89">90]<-0
df$"Dependents"[is.na(df$"Dependents")]<-0

```


```{r}
### 不平衡处理

#mean(df$"Dlqin2yrs")#0.065
#sum(df$"Dlqin2yrs"==1)
newdf<-df[df$"Dlqin2yrs"==1,]
DownsampleDat<-df[df$"Dlqin2yrs"==0,]
set.seed(77)
downsam<-sample(1:nrow(DownsampleDat),2100)

nDat<-rbind(newdf,DownsampleDat[downsam,])
set.seed(88)
nDat<-nDat[sample(nrow(nDat)),]
nDat$"Dlqin2yrs"<-as.factor(nDat$"Dlqin2yrs")

set.seed(36)
trainIndex <- createDataPartition(nDat$Dlqin2yrs, p = .8, list = FALSE)
ntrain<-nDat[trainIndex,]
ntest<-nDat[-trainIndex,]

```


2. Statistic Analysis
=======================================================================

Column {.tabset}
-----------------------------------------------------------------------


### 相关矩阵
```{r dev = "svg"}
M=cor(nDat[,-1])
corrplot(M, method="pie",tl.cex = 0.5,tl.pos="d",cl.pos = "b")
```


### 散点矩阵
```{r}
pairs.panels(nDat[,-1],cex.cor = 0.8, cex = 0.1)
```


### 各变量条件分布
```{r dev = "svg",eval=F}
# melting the data frame
feature.names<-names(nDat)[-1]

vizDat<- melt(nDat,id.vars = 'Dlqin2yrs'
              ,measure.vars = feature.names, variable.name = "Feature"
              ,value.name = "Value")

# conditional box plots for each feature on the response variable
p <- ggplot(data = vizDat, aes(x=Feature, y=Value)) + geom_boxplot(aes(fill=Dlqin2yrs),outlier.size = 0.3)
p <- p+ theme(axis.text.x = element_text(color="white"))+ facet_wrap( ~ Feature, scales="free")
p <- p + ggtitle("Conditional Distributions of each variable")
p
```
Column {data-width=300} ----------------------------------------------------------------------- ### 说明

1.关于缺失值插补与异常值删除 - 绝大多数的变量都使用中位数插补法 - 更好的选择是KNN插补,但是特别耗费运算时间 - 剔除了DebtRatio(负债率)超过100000和Income(月收入)超过300000的记录

2.对类失衡的处理 - 数据中一共有`r sum(df$"Dlqin2yrs"==1)`条违约(逾期)记录,占比为`r mean(df$"Dlqin2yrs")` - 采用欠采样的方法处理

3.这三张图表 - 反映了各自变量相关之间的关系以及与因变量之间的关系
3. Logistic Regression ======================================================================= Column {.tabset} ----------------------------------------------------------------------- ### 初步回归
logi.model <- glm(Dlqin2yrs~.,data = ntrain,family = "binomial")  
summary(logi.model)
```{r} logi.model<-glm(Dlqin2yrs~.,data = ntrain,family = "binomial") summary(logi.model) ``` ### 增加交互项并进行逐步回归
inter.logi <- glm(Dlqin2yrs~(.)^2, data = ntrain, family = "binomial")
step.logi <- step(inter.logi, trace = 0)
```{r} inter.logi<-glm(Dlqin2yrs~(.)^2,data = ntrain,family = "binomial") step.logi<-step(inter.logi, trace = 0) aa=as.data.frame(round(summary(step.logi)$coefficients,5)) aa$mark=ifelse(aa$`Pr(>|z|)`>0.06," ",ifelse(aa$`Pr(>|z|)`>0.02,"*","**")) kable(aa,format = "markdown", align = "c") ``` Column {.tabset} ----------------------------------------------------------------------- ### 初步回归效应 ```{r dev = "png", fig.width=8,fig.height=7} plot(allEffects(logi.model),cex=0.2,main=NULL) ``` ### 逐步回归效应 ```{r dev = "png", fig.align="center", fig.width=8, fig.height=6} plot(effect("Utilization:days90",step.logi),cex=0.5) plot(effect("days30_59:days60_89",step.logi),cex=0.5) plot(effect("Age:OpenCredit",step.logi),cex=0.5) plot(effect("RealEstate:days60_89",step.logi),cex=0.5) ``` 4. Evaluation and Comparing {data-orientation=rows} ======================================================================= Row ------------------------------------- ```{r} pred.step.logi<-predict(step.logi,ntest[,-1],type='response') out.step<-cvAUC(pred.step.logi,ntest$Dlqin2yrs) ``` ### a.逻辑回归ROC曲线,AUC值为`r out.step$cvAUC` ```{r} FPR=out.step$perf@x.values[[1]];TPR=out.step$perf@y.values[[1]] p <- ggplot(data = NULL,mapping = aes(FPR,TPR))+ geom_line(colour="red")+ xlab(out.step$perf@x.name) + ylab(out.step$perf@y.name) p <- p+geom_line(mapping = aes(x=seq(0,1,by=0.01),y=seq(0,1,by=0.01)), colour="lightblue",linetype=2) ggplotly(p) ``` ### b. 随机森林变量重要性 ```{r} #randomForest set.seed(123) ml.forest<-randomForest(Dlqin2yrs~., data = ntrain,mtry=2, importance=TRUE, ntree=1000) pred.forest<-predict(ml.forest,ntest[,-1],'prob')[,2] pr.forest <- prediction(pred.forest, ntest$Dlqin2yrs) prf.forest <- performance(pr.forest, measure = "tpr", x.measure = "fpr") par(bg="#F0F0F0") cl= c("#009393","#2894FF","sienna3") varImpPlot(ml.forest,color=cl,bg=cl,main="RandomForest Var-Importance") par(bg="white") ``` ### 说明
- 该逻辑回归是增加交叉项并使用逐步回归筛选变量后的逻辑回归 - 变量重要性可以用作变量选择以及在权值分配上做参考 - 随机森林使用了准确率和基尼指数进行重要性的度量 - 它认为较为重要的变量是:**Utilization / Age / days30_59 / days90**
Row ------------------------------------- ### c. Xgboost模型变量重要性 ```{r} #gbm set.seed(234) ml.gbm <- gbm.fit(x = ntrain[,-1],y = as.numeric(ntrain[,1])-1, n.trees = 300,interaction.depth = 3, var.monotone = c(0,0,1,1,0,0,0,0,0,0), shrinkage = 0.01, n.minobsinnode = 10, nTrain =nrow(ntrain)*0.8, verbose = 0) pred.gbm <- predict(ml.gbm,ntest[,-1],n.trees = 300) pr.gbm <- prediction(pred.gbm, ntest$Dlqin2yrs) prf.gbm <- performance(pr.gbm, measure = "tpr", x.measure = "fpr") #xgboost set.seed(456) dtrain <- xgb.DMatrix(data = as.matrix(ntrain[,-1]) , label = as.numeric(ntrain$Dlqin2yrs)-1) dtest<- xgb.DMatrix(data = as.matrix(ntest[,-1]) , label = as.numeric(ntest$Dlqin2yrs)-1) ml.bst <- xgb.train(data=dtrain, max.depth=3, eta=0.01, nthread = 2, nround=100, eval.metric = "error", eval.metric = "logloss", verbose = 0, objective = "binary:logistic") xgb.imp <- xgb.importance(model = ml.bst) xgb.imp$Feature=names(ntrain)[-1][as.numeric(xgb.imp$Feature)+1] #xgb.plot.importance(importance_matrix = xgb.imp) xgb.imp$Feature=factor(xgb.imp$Feature,levels = xgb.imp$Feature[order(xgb.imp$Gain,decreasing = T)]) p <- ggplot(data = xgb.imp, mapping = aes(x=Feature,y=Gain,fill=Feature))+ geom_bar(stat="identity")+theme(axis.text.x = element_text(angle = 45)) ggplotly(p) pred.xgb<-predict(ml.bst,dtest) pr.xgb <- prediction(pred.xgb, ntest$Dlqin2yrs) prf.xgb <- performance(pr.xgb, measure = "tpr", x.measure = "fpr") ``` ### d. 四种模型进行比较 ```{r} fpr.rf=prf.forest@x.values[[1]];tpr.rf=prf.forest@y.values[[1]] fpr.gbm=prf.gbm@x.values[[1]];tpr.gbm=prf.gbm@y.values[[1]] fpr.xgb=prf.xgb@x.values[[1]];tpr.xgb=prf.xgb@y.values[[1]] lens=sapply(list(FPR,fpr.rf,fpr.gbm,fpr.xgb),length) roc <- data.frame(Model=rep(c("LR","RF","GBM","XGB"),lens), fpr=c(FPR,fpr.rf,fpr.gbm,fpr.xgb), tpr=c(TPR,tpr.rf,tpr.gbm,tpr.xgb)) #roc curve p <- ggplot(data = roc, mapping = aes(x=fpr,y=tpr,colour=Model))+ geom_line()+xlab(out.step$perf@x.name) + ylab(out.step$perf@y.name)+ geom_line(mapping = aes(x=seq(0,1,length.out = sum(lens)), y=seq(0,1,length.out = sum(lens))), colour="gray",linetype=2) ggplotly(p) #auc allAuc <- data.frame(Model=c("LR","RF","GBM","XGB"), AUC=c(out.step$cvAUC, performance(pr.forest, measure = "auc")@y.values[[1]], performance(pr.gbm, measure = "auc")@y.values[[1]], performance(pr.xgb, measure = "auc")@y.values[[1]])) allAuc=allAuc[order(allAuc$AUC,decreasing = T),] ``` ### 说明
- 这里将逻辑回归与三种(集成)树模型进行了比较 - 对Xgboost也输出了变量重要性,它认为较为重要的变量是:**Utilization / days90 / days30_59 / days60_89** - 四种模型的AUC值从高到低依次是:
`r kable(allAuc,format = "markdown",align = "c")` 5. WOE Transformation ======================================================================= Column {data-width=700} ----------------------------------------------------------------------- ### 变量分箱 ```{r} ### 连续变量分箱 library(discretization) nDat.list <- as.list(nDat[,-1]) cut.points <- lapply(nDat.list,FUN = function(x){ p=cutPoints(x,y=nDat$Dlqin2yrs)#信息熵 if(length(p)<=2){ p=unique(as.numeric(quantile(x,c(0,1/4,2/4,3/4,1))))#等宽分箱 if(length(p)<=4){ return("Category") #手动分类 } else { return(p) } } c(min(x),sort(p),max(x)) }) nDat.group=nDat catgry=c() for(i in 1:length(cut.points)){ if(cut.points[[i]][1]=="Category") { catgry=c(catgry,names(cut.points[i])) next } nDat.group[,i+1]=as.integer(cut(nDat.list[[i]], breaks = cut.points[[i]], include.lowest = TRUE)) } ### 分类变量手动处理 #sapply(nDat.list[catgry],table) nDat.group$days30_59[nDat.group$days30_59>6] <- 6 nDat.group$days90[nDat.group$days90>4] <- 4 nDat.group$RealEstate[nDat.group$RealEstate>6] <- 6 nDat.group$days60_89[nDat.group$days60_89>3] <- 3 nDat.group$Dependents[nDat.group$Dependents>5] <- 5 kable(sample_n(nDat.group,20), format = "markdown", align = "c") ``` ### WOE变换 ```{r} library(woe) woe.all <- sapply(names(nDat.group)[-1], simplify = F, USE.NAMES = T, function(x){ temp = woe(Data = nDat.group, Independent = x, Continuous = F, Dependent = "Dlqin2yrs", C_Bin = length(unique(nDat.group[,x])), Bad = 0, Good = 1)[,c("BIN","BAD%","GOOD%")] temp$`BAD%`=ifelse(temp$`BAD%`<0.001,0.001,temp$`BAD%`) temp$WOE = log(temp$`GOOD%`/temp$`BAD%`)#逾期(1)/按时(0) temp[,c("BIN","WOE")] }) #woe.all nDat.woe = nDat.group for(v in names(nDat.group)[-1]){ nDat.woe = merge(nDat.woe,woe.all[v][[1]],by.x=v,by.y="BIN",all=F) nDat.woe[,v]= nDat.woe$WOE nDat.woe$WOE <- NULL } aa=sample_n(nDat.woe,20) for(i in 1:10){ aa[,i]=round(aa[,i],4) } kable(aa, format = "markdown", align = "c") ``` Column ----------------------------------------------------------------------- ### 各变量对应的WOE值 ```{r fig.align="center"} woe.df=data.frame() for(i in 1:length(woe.all)){ temp=woe.all[[i]];temp$Feature=names(woe.all[i]) woe.df=rbind(woe.df,temp) } p <- ggplot(data = woe.df, aes(x=Feature,y=WOE,colour=BIN))+geom_point(size=3,alpha = 0.8) p <- p + theme(axis.text.x = element_text(angle = 45)) + ggtitle("Vars with WOE") ggplotly(p) ``` 6. Score Card ======================================================================= Column ----------------------------------------------------------------------- ### 标准评分卡模型 ```{r} last.lm <- glm(Dlqin2yrs~. ,nDat.woe, family = "binomial") ```
1. 逻辑回归的各项系数为:
`r coef(last.lm)`

2. 我们知道,Logistic回归一般方程形式为: $$ \log(odds)=\log(\frac{p}{1-p})=\beta_0+\beta_1*x_1+\beta_2*x_2+...+\beta_9*x_9+\beta_{10}*x_{10} $$ 3. 在我们的分析中,对应方程是: $$ \log(odds)=\log(\frac{p}{1-p})=-0.0290+0.6746*Dependents+0.5690*days60.89+0.6982*RealEstate+0.5904*days90-\\ 0.2174*OpenCredit+0.3615*Income+1.0696*DebtRatio+0.6254*days30.59+0.3928*Age+0.6825*Utilization $$ 注意,这里的变量名代表该变量的woe值,且$$p=P_{(Dlqin2yrs=1)}$$ 4. 经过WOE变换后,方程形式为: $$ \log(odds)=\log(\frac{p}{1-p})=\beta_0+\\ \beta_1*(w_{11}*\delta_{11}+w_{12}*\delta_{12}+...)+\\ \beta_2*(w_{21}*\delta_{21}+w_{22}*\delta_{22}+...)+...+\\ \beta_{10}*(w_{10,1}*\delta_{10,1}+w_{10,2}*\delta_{10,2}+...) $$ 5. 因此各自变量每个因子水平的分值即为βi*wij,δ为二元变量,取值只有 0或1,且某行记录在某个xi 的各个因子水平上只能有一个1,其它均是0,取 1表示这行记录在xi 变量上取第j 个水平。该方程也被称作标准评分卡模型。
Column {.tabset} ----------------------------------------------------------------------- ### 评分卡的创建和实施
1. 刻度转换公式: $${\rm Score} = A-B*\log(odds)$$
常用的假设: 1. 违约/正常比为15 时,对应600 分。 2. 违约/正常比翻一倍扣掉20 分。 根据下表计算可知: A=521.87, B=28.85。
```{r} p=function(x)x/(x+1) scales <- data.frame("分值score" = 720-20*1:9, "好坏比odds" = paste("1 :",480/(2^(0:8))), "违约概率p" = p((1/480)*(2^(0:8)))) kable(scales, format = "markdown", align = "c") ```
2. 联立逻辑回归方程和刻度公式,可得分值表达式: $$ {\rm Score} = A-B\beta_0-\\ B\beta_1*(w_{11}*\delta_{11}+w_{12}*\delta_{12}+...)-\\ B\beta_2*(w_{21}*\delta_{21}+w_{22}*\delta_{22}+...)-...-\\ B\beta_{10}*(w_{10,1}*\delta_{10,1}+w_{10,2}*\delta_{10,2}+...) $$
### 最终评分卡 ```{r} beta <- data.frame(coef(last.lm)) beta$Feature=row.names(beta) beta=beta[-1,] row.names(beta) <- NULL; names(beta)[1] <- "beta" woe.df.beta <- merge(woe.df, beta, by = "Feature", all = F) woe.df.beta$Score = -28.85*woe.df.beta$WOE*woe.df.beta$beta #woe.df.beta ```
### 在绍兴项目上的应用
```{r} sdry = strsplit(x = '年龄 职业 婚姻状况 受教育程度 出入境特征 是否吸毒前科 是否其他类前科 入住、退房时间不正常 上网时间不正常 与涉毒人员同住一个旅馆 与涉毒人员同乘一个航班 与其他前科人员同住一个旅馆 与其他前科人员同乘一个航班 是否经常本地人入住本地旅馆', split='\n')[[1]] qcry = strsplit(x = '年龄 职业 婚姻状况 受教育程度 是否前科 与前科人员同住一个旅馆 与前科人员同乘一个航班 入住、退房时间不正常 上网时间不正常 是否经常本地人入住本地旅馆 过往在街道路面被盘查次数', split='\n')[[1]] yqry=b yqry[1]=paste0(yqry[1],"(y)") kable(data.frame(`银行逾期还款人员` = c(yqry,'···','···','···','···','···'), `涉毒类重点人员` = c('未来XX时间内是否会吸毒(y)',sdry,'···'), `侵财类重点人员` = c('未来XX时间内是否会侵财(y)',qcry,'···','···','···','···'))) ```